Fix autoscaler returning fatal error on GOAWAY if no scaling seen #973

Sushisource · 2025-08-09T00:51:10Z

What was changed

Fix autoscaling pollers possibly considering a GOAWAY/connection closed thing as fatal when they shouldn't.

Why?

Don't want workers to die in this situation

Checklist

Closes [Bug] Temporal Worker RuntimeError: Poll failure: Unhandled grpc error when polling #964
How was this tested:
Added unit tests, verified with long-running integ test. Unfortunately I could not find any reasonable way to simulate this exact weirdo goaway error thing with the fake in-memory grpc server I have, even after tons of research.
Any docs updates needed?

Sushisource · 2025-08-09T00:52:05Z

tests/integ_tests/client_tests.rs

 }

 async fn fake_server<F>(response_maker: F) -> FakeServer
 where
-    F: FnMut() -> Response<Body> + Clone + Send + Sync + 'static,
+    F: FnMut() -> BoxFuture<'static, Response<Body>> + Clone + Send + Sync + 'static,


I made this change while trying to get the fake server to create the goaway. Couldn't do it, but, it's probably useful for the fake responses to be async anyway.

Sushisource · 2025-08-09T00:59:23Z

core/src/pollers/poll_buffer.rs

+                    // Only propagate errors out if they weren't because of the short-circuiting
+                    // logic. IE: We don't want to fail callers because we said we wanted to know
+                    // about ResourceExhausted errors, but we haven't seen a scaling decision yet,
+                    // so we're not reacting to errors, only propagating them.
+                    return !e
+                        .metadata()
+                        .contains_key(ERROR_RETURNED_DUE_TO_SHORT_CIRCUIT);


This is the fix

Can I get some background here? I expect Tonic to hide go-away and implicitly handle reconnects. We send a go away every 5m on average. Is there a situation where a regular non-worker client use may see some go-away?

There's a bug in hyper or tonic that was worked around here: #811

But, the short-circuit that the autoscaler turns on so it can scale better on otherwise non-fatal errors makes this come through as fatal - so it gets ignored specifically here.

(marking approved anyways, just trying to understand some strangeness here)

So IIUC the server by default sends a GoAway after 5m (aka soft connection close to tell you to stop making more calls after whatever is in flight) and then hard-closes the TCP connection after 7m (because 2m is enough time between soft and hard for you to never hit this in a properly behaving client).

So somehow we're sending RPC calls even after soft close and therefore hitting the 7m limit? If that is the case, there can be data loss in a rare/racy way if hard TCP death occurs during gRPC call (in our case it'd be a task timeout because server might send us task and then close conn). Or maybe Tonic is the one that's eagerly returning a Code::Cancelled w/ "connection closed" before it ever even makes the call? Obviously not important if it all works today, but it is a bit confusing to me.

I think the latter thing is the explanation, but yeah I agree it's confusing.

Sushisource requested a review from a team as a code owner August 9, 2025 00:51

Sushisource commented Aug 9, 2025

View reviewed changes

Sushisource force-pushed the fix-fatal-autoscale-poll-err branch from 6efa35e to 68a9191 Compare August 9, 2025 00:58

Sushisource commented Aug 9, 2025

View reviewed changes

Sushisource added 2 commits August 14, 2025 11:22

Fix autoscaler returning fatal error on GOAWAY if no scaling seen

9236e6b

Lint fix

a96360b

Sushisource force-pushed the fix-fatal-autoscale-poll-err branch from 68a9191 to a96360b Compare August 14, 2025 18:22

cretz approved these changes Aug 14, 2025

View reviewed changes

Merge branch 'master' into fix-fatal-autoscale-poll-err

61090b2

Sushisource merged commit 148774c into master Aug 15, 2025
16 of 18 checks passed

Sushisource deleted the fix-fatal-autoscale-poll-err branch August 15, 2025 21:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix autoscaler returning fatal error on GOAWAY if no scaling seen #973

Fix autoscaler returning fatal error on GOAWAY if no scaling seen #973

Uh oh!

Sushisource commented Aug 9, 2025

Uh oh!

Sushisource Aug 9, 2025

Uh oh!

Sushisource Aug 9, 2025

Uh oh!

cretz Aug 11, 2025

Uh oh!

Sushisource Aug 14, 2025

Uh oh!

cretz Aug 14, 2025 •

edited

Loading

Uh oh!

Sushisource Aug 15, 2025

Uh oh!

Uh oh!

Uh oh!

Fix autoscaler returning fatal error on GOAWAY if no scaling seen #973

Fix autoscaler returning fatal error on GOAWAY if no scaling seen #973

Uh oh!

Conversation

Sushisource commented Aug 9, 2025

What was changed

Why?

Checklist

Uh oh!

Sushisource Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

Sushisource Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

cretz Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

Sushisource Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

cretz Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sushisource Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cretz Aug 14, 2025 •

edited

Loading